DiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing

نویسندگان

Bin Fan

Wittawat Tantisiriroj

Lin Xiao

Garth Gibson

چکیده

The first generation of Data-Intensive Scalable Computing file systems such as Google File System and Hadoop Distributed File System employed n (n ≥ 3) replications for high data reliability, therefore delivering users only about 1/n of the total storage capacity of the raw disks. This paper presents DiskReduce, a framework integrating RAID into these replicated storage systems to significantly reduce the storage capacity overhead, for example, from 200% to 25% when triplicated data is dynamically replaced with RAID sets (e.g. 8 + 2 RAID 6 encoding). Based on traces collected from Yahoo!, Facebook and Opencloud cluster, we analyze (1) the capacity effectiveness of simple and not so simple strategies for grouping data blocks into RAID sets; (2) implication of reducing the number of data copies on read performance and how to overcome the degradation; and (3) different heuristics to mitigate “small write penalties”. Finally, we introduce an implementation of our framework that has been built and submitted into the Apache Hadoop project.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112)

Data-intensive file systems, developed for Internet services and popular in cloud computing, provide high reliability and availability by replicating data, typically three copies of everything. Alternatively high performance computing, which has comparable scale, and smaller scale enterprise storage systems get similar tolerance for multiple failures from lower overhead erasure encoding, or RAI...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

DiskReduce: RAIDing the Cloud

Data-Intensive Scalable Computing (DISC) file systems such as HDFS employs replication for reliability, typically delivering users with only about a third of the storage capacity of the raw disks. In this project, we investigate DiskReduce, a framework for integrating RAID into these replicated storage systems to lower storage capacity overhead, for example, from 200% to 25% when triplicated da...

متن کامل

Distributed File System Based on Erasure Coding for I/O-Intensive Applications

Distributed storage systems take advantage of the network, storage and computational resources to provide a scalable infrastructure. But in such large system, failures are frequent and expected. Data replication is the common technique to provide fault-tolerance but suffers from its important storage consumption. Erasure coding is an alternative that offers the same data protection but reduces ...

متن کامل

IStore: Towards High Efficiency, Performance, and Reliability in Distributed Data Storage with Information Dispersal Algorithms

Reliability is one of the major challenges for high performance computing and cloud computing. Data replication is a commonly used mechanism to achieve high reliability. Unfortunately, it has a low storage efficiency among other shortcomings. As an alternative to data replication, information dispersal algorithms offer higher storage efficiency, but at the cost of being too computing-intensive ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

DiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing

نویسندگان

چکیده

منابع مشابه

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112)

Data Replication-Based Scheduling in Cloud Computing Environment

DiskReduce: RAIDing the Cloud

Distributed File System Based on Erasure Coding for I/O-Intensive Applications

IStore: Towards High Efficiency, Performance, and Reliability in Distributed Data Storage with Information Dispersal Algorithms

عنوان ژورنال:

اشتراک گذاری